41 research outputs found
Memory and Parallelism Analysis Using a Platform-Independent Approach
Emerging computing architectures such as near-memory computing (NMC) promise
improved performance for applications by reducing the data movement between CPU
and memory. However, detecting such applications is not a trivial task. In this
ongoing work, we extend the state-of-the-art platform-independent software
analysis tool with NMC related metrics such as memory entropy, spatial
locality, data-level, and basic-block-level parallelism. These metrics help to
identify the applications more suitable for NMC architectures.Comment: 22nd ACM International Workshop on Software and Compilers for
Embedded Systems (SCOPES '19), May 201
PET-to-MLIR:A polyhedral front-end for MLIR
We present PET-to-MLIR, a new tool to enter the MLIR compiler framework from C source. The tool is based on the popular PET and ISL libraries for extracting and manipulating quasi-affine sets and relations, and Loop Tactics, a declarative optimizer. The use of PET brings advanced diagnosis and full support for C by relying on the Clang parser. ISL allows easy manipulation of the polyhedral representation and efficient code generation. Loop Tactics, on the other hand, enable us to detect computational motifs transparently and lift the entry point in MLIR, thus enabling domain-specific optimizations in general-purpose code.We demonstrate our tool using the Polybench/C benchmark suite and show that it can lower most of the benchmarks to the MLIR’s affine dialect successfully. We believe that our tool can benefit research in the compiler community by providing an automatic way to translate C code to the MLIR affine dialect
NMPO:Near-Memory Computing Profiling and Offloading
Real-world applications are now processing big-data sets, often bottlenecked by the data movement between the compute units and the main memory. Near-memory computing (NMC), a modern data-centric computational paradigm, can alleviate these bottlenecks, thereby improving the performance of applications. The lack of NMC system availability makes simulators the primary evaluation tool for performance estimation. However, simulators are usually time-consuming, and methods that can reduce this overhead would accelerate the early-stage design process of NMC systems. This work proposes Near-Memory computing Profiling and Offloading (NMPO), a high-level framework capable of predicting NMC offloading suitability employing an ensemble machine learning model. NMPO predicts NMC suitability with an accuracy of 85.6% and, compared to prior works, can reduce the prediction time by using hardware-dependent applications features by up to 3 order of magnitude
Near Memory Acceleration on High Resolution Radio Astronomy Imaging
Modern radio telescopes like the Square Kilometer Array (SKA) will need to
process in real-time exabytes of radio-astronomical signals to construct a
high-resolution map of the sky. Near-Memory Computing (NMC) could alleviate the
performance bottlenecks due to frequent memory accesses in a state-of-the-art
radio-astronomy imaging algorithm. In this paper, we show that a sub-module
performing a two-dimensional fast Fourier transform (2D FFT) is memory bound
using CPI breakdown analysis on IBM Power9. Then, we present an NMC approach on
FPGA for 2D FFT that outperforms a CPU by up to a factor of 120x and performs
comparably to a high-end GPU, while using less bandwidth and memory
TDO-CIM: Transparent Detection and Offloading for Computation In-memory
Computation in-memory is a promising non-von Neumann approach aiming at
completely diminishing the data transfer to and from the memory subsystem.
Although a lot of architectures have been proposed, compiler support for such
architectures is still lagging behind. In this paper, we close this gap by
proposing an end-to-end compilation flow for in-memory computing based on the
LLVM compiler infrastructure. Starting from sequential code, our approach
automatically detects, optimizes, and offloads kernels suitable for in-memory
acceleration. We demonstrate our compiler tool-flow on the PolyBench/C
benchmark suite and evaluate the benefits of our proposed in-memory
architecture simulated in Gem5 by comparing it with a state-of-the-art von
Neumann architecture.Comment: Full version of DATE2020 publicatio
Prebypass: Software Register File Bypassing for Reduced Interconnection Architectures
Exposed Datapath Architectures (EDPAs) with aggressively pruned data-path connectivity, where not all function units in the design have connections to a centralized register file, are promising solutions for energy-efficient computation. A direct bypassing of data between function units without temporary copies to the register file is a prime optimization for programming such architectures. However, traditional compiler frameworks, such as LLVM, assume function-units connect to register-files and allocate all live variables in register-files. This leads to schedule inefficiencies in terms of instruction-level parallelism and reg-ister accesses in the EDPAs. To address these inefficiencies, we propose Prebypass; a new optimization pass for EDPA compiler backends. Experimental results on an EDPA class of architecture, Transport- Triggered Architecture, show that Prebypass improves the runtime, register reads, and register writes up to 16%, 26 %, and 37 % respectively, when the datapath is extremely pruned. Evaluation in a 28-nm FDSOI technology reveals that Prebypass improves the core-level Energy by 17.5 % over the current heuristic scheduler.acceptedVersionPeer reviewe
Exploring processor parallelism: Estimation methods and optimization strategies,”
Abstract-Automatic optimization of application-specific instruction-set processor (ASIP) architectures mostly focuses on the internal memory hierarchy design, or the extension of reduced instruction-set architectures with complex custom operations. This paper focuses on very long instruction word (VLIW) architectures and, more specifically, on automating the selection of an application specific VLIW issue-width. The issuewidth selection strongly influences all the important processor properties (e.g. processing speed, silicon area, and power consumption). Therefore, an accurate and efficient issue-width estimation and optimization are some of the most important aspects of VLIW ASIP design. In this paper, we first compare different methods for the estimation of required the issue-width, and subsequently introduce a new force-based parallelism estimation method which is capable of estimating the required issue-width with only 3% error on average. Furthermore, we present and compare two techniques for estimating the required issue-width of software pipelined loop kernels and show that a simple utilization-based measure provides an error margin of less than 1% on average
Hauptsätze der Differential- und Integral-Rechnung : als Leitfaden zum Gebrauch bei Vorlesungen / zusammengestellt von Robert Fricke ; 1. Theil
\u3cp\u3eThe conventional approach of moving data to the CPU for computation has become a significant performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in 3D integration technologies has made the decade-old concept of coupling compute units close to the memory — called near-memory computing (NMC) — more viable. Processing right at the “home” of data can significantly diminish the data movement problem of data-intensive applications. In this paper, we survey the prior art on NMC across various dimensions (architecture, applications, tools, etc.) and identify the key challenges and open issues with future research directions. We also provide a glimpse of our approach to near-memory computing that includes i) NMC specific microarchitecture independent application characterization ii) a compiler framework to offload the NMC kernels on our target NMC platform and iii) an analytical model to evaluate the potential of NMC.\u3c/p\u3